Crowdsourcing High-Quality Parallel Data Extraction from Twitter

نویسندگان

  • Wang Ling
  • Luís Marujo
  • Chris Dyer
  • Alan W. Black
  • Isabel Trancoso
چکیده

High-quality parallel data is crucial for a range of multilingual applications, from tuning and evaluating machine translation systems to cross-lingual annotation projection. Unfortunately, automatically obtained parallel data (which is available in relative abundance) tends to be quite noisy. To obtain high-quality parallel data, we introduce a crowdsourcing paradigm in which workers with only basic bilingual proficiency identify translations from an automatically extracted corpus of parallel microblog messages. For less than $350, we obtained over 5000 parallel segments in five language pairs. Evaluated against expert annotations, the quality of the crowdsourced corpus is significantly better than existing automatic methods: it obtains an performance comparable to expert annotations when used in MERT tuning of a microblog MT system; and training a parallel sentence classifier with it leads also to improved results. The crowdsourced corpora will be made available in http://www.cs.cmu.edu/ ~lingwang/microtopia/.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Keyword Extraction on Twitter

In this paper, we build a corpus of tweets from Twitter annotated with keywords using crowdsourcing methods. We identify key differences between this domain and the work performed on other domains, such as news, which makes existing approaches for automatic keyword extraction not generalize well on Twitter datasets. These datasets include the small amount of content in each tweet, the frequent ...

متن کامل

Crowdsourcing Ambiguity-Aware Ground Truth

The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, this assumption often creates is...

متن کامل

Exploring the Geographical Relations Between Social Media and Flood Phenomena to Improve Situational Awareness - A Study About the River Elbe Flood in June 2013

Recent research has shown that social media platforms like twitter can provide relevant information to improve situation awareness during emergencies. Previous work is mostly concentrated on the classification and analysis of tweets utilizing crowdsourcing or machine learning techniques. However, managing the high volume and velocity of social media messages still remains challenging. In order ...

متن کامل

Raimond: Quantitative Data Extraction from Twitter to Describe Events

Social media play a decisive role in communicating and spreading information during global events. In particular, real-time microblogging platforms such as Twitter have become prevalent. Researchers have used microblogging for a number of tasks, including past events analysis, predictions, and information retrieval. Nevertheless, little attention has been given to quantitative data extraction. ...

متن کامل

Extraction of Pluvial Flood Relevant Volunteered Geographic Information (VGI) by Deep Learning from User Generated Texts and Photos

In recent years, pluvial floods caused by extreme rainfall events have occurred frequently. Especially in urban areas, they lead to serious damages and endanger the citizens’ safety. Therefore, real-time information about such events is desirable. With the increasing popularity of social media platforms, such as Twitter or Instagram, information provided by voluntary users becomes a valuable so...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014